| Column Name | Description | Remark |
|---|---|---|
| CustomerID | 識別每個客戶的唯一 ID | String |
| Gender | 顧客的性別 | Male, Female |
| Age | 財務季度結束時客戶的當前年齡(以年為單位) | Number |
| Senior Citizen | 是否年滿 65 歲 | Yes, No |
| Married (Partner) | 是否已婚 | Yes, No |
| Dependents | 是否與任何家屬同住 | Yes, No. |
| Number of Dependents | 是否與任何受扶養人同住 | Yes, No. |
| Phone Service | 是否向公司訂購了家庭電話服務 | Yes, No |
| Multiple Lines | 是否與公司預訂了多條電話線路 | Yes, No |
| Internet Service | 是否向本公司訂購網路服務 | No, DSL, Fiber Optic, Cable. |
| Online Security | 是否訂閱了公司提供的附加線上安全服務 | Yes, No |
| Online Backup | 是否訂閱了本公司提供的附加線上備份服務 | Yes, No |
| Device Protection Plan | 是否訂閱了該公司為其互聯網設備提供的附加設備保護計劃 | Yes, No |
| Premium Tech Support | 是否訂閱了公司的附加技術支援計劃以減少等待時間 | Yes, No |
| Streaming TV | 是否使用其網路服務從第三方供應商串流媒體電視節目 | Yes, No. |
| Streaming Movies | 是否使用其 Internet 服務從第三方供應商串流影片 | Yes, No. |
| Contract | 客戶目前的合約類型 | Month-to-Month, One Year, Two Year. |
| Paperless Billing | 客戶是否選擇無紙化計費 | Yes, No |
| Payment Method | 客戶如何支付帳單 | Bank Withdrawal, Credit Card, Mailed Check |
| Monthly Charge | 客戶目前每月為本公司提供的所有服務支付的總費用 | Number |
| Total Charges | 截至上述指定季度末計算的客戶總費用 | Number |
| Tenure | 客戶在公司工作的總月數 | Number |
| Churn | 是 = 客戶本季離開了公司;否 = 客戶仍留在公司 | Yes, No |
Read the dataset.
introduce (exclude?)| rows | columns | discrete_columns | continuous_columns | all_missing_columns | total_missing_values | complete_rows | total_observations | memory_usage |
|---|---|---|---|---|---|---|---|---|
| 7043 | 21 | 17 | 4 | 0 | 11 | 7032 | 147903 | 1641832 |
plot_introPlot basic description for the data, including:
columns (features) : discrete / continuous, missing columns
rows (customers) : complete rows.
missing observations.
plot_missingOnly display the features which have missing data.
plot_barShow the descrete data.
## 1 columns ignored with more than 50 categories.
## customerID: 7043 categories
plot_histogramShow the continuous data.
plot_qq (w/o “Churn”)Q-Q plot : campare the data (scatter) with the other distribution (line).
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_qq()`).
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_qq_line()`).
plot_qq (w/i “Churn”)Q-Q plot : campare the data (scatter) with the other distribution (line).
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_qq()`).
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_qq_line()`).
plot_boxplotBox plot - Plot the outlier with red.
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
plot_prcompPCA
## 1 features with more than 50 categories ignored!
## customerID: 7032 categories
The steps involved in data cleaning:
Check whether the data types are correct for each variable using
str() function.
Handling Missing Values:
2.1. Perform KNN (K-Nearest Neighbors) imputation specifically for the “TotalCharges” variable.
Standardizing Data(Convert text to a consistent case):
3.1. Conditionally transform values that start with “N” and replace them with “No”.
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : chr "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr "Yes" "No" "No" "No" ...
## $ Dependents : chr "No" "No" "No" "No" ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr "No phone service" "No" "No" "No phone service" ...
## $ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr "No" "No" "No" "Yes" ...
## $ StreamingTV : chr "No" "No" "No" "No" ...
## $ StreamingMovies : chr "No" "No" "No" "No" ...
## $ Contract : chr "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : chr "No" "No" "Yes" "No" ...
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : chr "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr "Yes" "No" "No" "No" ...
## $ Dependents : chr "No" "No" "No" "No" ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr "No" "No" "No" "No" ...
## $ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr "No" "No" "No" "Yes" ...
## $ StreamingTV : chr "No" "No" "No" "No" ...
## $ StreamingMovies : chr "No" "No" "No" "No" ...
## $ Contract : chr "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : chr "No" "No" "Yes" "No" ...
The steps involved in feature engineering:
Remove specific columns (“TotalCharges”)
1.1. The removal of “customerID” is typically an identifier column that doesn’t contain useful information for modeling
1.2. The removal of “TotalCharges” is justified by its high correlation with other columns (“MonthlyCharges” x “tenure” = “TotalCharges”), potentially leading to multicollinearity issues in modeling.
Encoding Categorical Variables:
2.1. Label Encoding: Converting categorical data to numbers where the order matters.
2.2. One-Hot Encoding: Converting categorical data to a binary (0 or 1) format.
Scaling and Normalization:
3.1. Min-Max Scaling: Scaling data to a fixed range 0 to 1.
Oversample the Minority Class:
4.1. Random Oversampling: New instances are essentially duplicates of existing minority class instances, leading to an increase in the number of samples belonging to the minority class.
Feature Selection:
5.1. Random Forest Importance
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : chr "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr "Yes" "No" "No" "No" ...
## $ Dependents : chr "No" "No" "No" "No" ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr "No" "No" "No" "No" ...
## $ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr "No" "No" "No" "Yes" ...
## $ StreamingTV : chr "No" "No" "No" "No" ...
## $ StreamingMovies : chr "No" "No" "No" "No" ...
## $ Contract : chr "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : chr "No" "No" "Yes" "No" ...
## 'data.frame': 7043 obs. of 25 variables:
## $ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : int 0 1 1 1 0 0 1 0 0 1 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : int 1 0 0 0 0 0 0 0 1 0 ...
## $ Dependents : int 0 0 0 0 0 0 1 0 0 1 ...
## $ tenure : num -1.2774 0.0663 -1.2366 0.5142 -1.2366 ...
## $ PhoneService : int 0 1 1 0 1 1 1 0 1 1 ...
## $ MultipleLines : int 0 0 0 0 0 1 1 0 1 0 ...
## $ InternetService : int 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : int 0 1 1 1 0 0 0 1 0 1 ...
## $ OnlineBackup : int 1 0 1 0 0 0 1 0 0 1 ...
## $ DeviceProtection : int 0 1 0 1 0 1 0 0 1 0 ...
## $ TechSupport : int 0 0 0 1 0 0 0 0 1 0 ...
## $ StreamingTV : int 0 0 0 0 0 1 1 0 1 0 ...
## $ StreamingMovies : int 0 0 0 0 0 1 0 0 1 0 ...
## $ ContractMonth.to.month : int 1 0 1 0 1 1 1 1 1 0 ...
## $ ContractOne.year : int 0 1 0 1 0 0 0 0 0 1 ...
## $ ContractTwo.year : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PaperlessBilling : int 1 0 1 0 1 1 1 0 1 0 ...
## $ PaymentMethodBank.transfer..automatic.: int 0 0 0 1 0 0 0 0 0 1 ...
## $ PaymentMethodCredit.card..automatic. : int 0 0 0 0 0 0 1 0 0 0 ...
## $ PaymentMethodElectronic.check : int 1 0 0 0 1 1 0 0 1 0 ...
## $ PaymentMethodMailed.check : int 0 1 1 0 0 0 0 1 0 0 ...
## $ MonthlyCharges : num -1.16 -0.26 -0.363 -0.746 0.197 ...
## $ Churn : int 0 0 1 0 1 1 0 0 1 0 ...
data <- read_csv("../data/feature/feature_set.csv")
n <- nrow(data)
data <- data[sample(n),] #將資料進行隨機排列
index <- createDataPartition(data$Churn, p = 0.8, list = FALSE)
train_data <- data[index, ]
test_data <- data[-index, ]
train_matrix <- xgb.DMatrix(data = as.matrix(train_data[-c(1, which(names(train_data) == "Churn"))]), label = train_data$Churn)
test_matrix <- xgb.DMatrix(data = as.matrix(test_data[-c(1, which(names(test_data) == "Churn"))]), label = test_data$Churn)
params <- list(
objective = "binary:logistic",
eval_metric = "auc",
eta = 0.1,
max_depth = 6
)
nrounds <- 100
nfold <- 5
early_stopping_rounds <- 10
verbose <- 1
# using xgb.cv to do k-fold cross validation
cv_result <- xgb.cv(
params = params,
data = train_matrix,
nrounds = nrounds,
nfold = nfold,
early_stopping_rounds = early_stopping_rounds,
verbose = verbose
)
## [1] train-auc:0.860743+0.004023 test-auc:0.828446+0.023039
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 10 rounds.
##
## [2] train-auc:0.865658+0.003397 test-auc:0.832031+0.021830
## [3] train-auc:0.869886+0.004021 test-auc:0.835190+0.021159
## [4] train-auc:0.871866+0.003650 test-auc:0.835934+0.020832
## [5] train-auc:0.874667+0.004006 test-auc:0.837381+0.020441
## [6] train-auc:0.876757+0.004224 test-auc:0.837741+0.021015
## [7] train-auc:0.878987+0.004664 test-auc:0.838346+0.020821
## [8] train-auc:0.880340+0.004764 test-auc:0.838527+0.020790
## [9] train-auc:0.881991+0.004253 test-auc:0.838548+0.020479
## [10] train-auc:0.883548+0.004300 test-auc:0.839295+0.020375
## [11] train-auc:0.884952+0.003763 test-auc:0.839602+0.020473
## [12] train-auc:0.886351+0.003854 test-auc:0.840465+0.020549
## [13] train-auc:0.887973+0.003749 test-auc:0.840873+0.020796
## [14] train-auc:0.889495+0.003588 test-auc:0.841287+0.021207
## [15] train-auc:0.891050+0.004001 test-auc:0.841721+0.021061
## [16] train-auc:0.892332+0.004076 test-auc:0.842097+0.021574
## [17] train-auc:0.893663+0.004112 test-auc:0.842224+0.021416
## [18] train-auc:0.895172+0.003788 test-auc:0.841956+0.021221
## [19] train-auc:0.896353+0.004053 test-auc:0.842118+0.021523
## [20] train-auc:0.897585+0.003920 test-auc:0.842256+0.021198
## [21] train-auc:0.898803+0.003980 test-auc:0.842433+0.021233
## [22] train-auc:0.899815+0.004086 test-auc:0.842617+0.021066
## [23] train-auc:0.900759+0.004255 test-auc:0.842591+0.021022
## [24] train-auc:0.901873+0.004339 test-auc:0.842154+0.020760
## [25] train-auc:0.902729+0.004202 test-auc:0.842282+0.020908
## [26] train-auc:0.903614+0.004248 test-auc:0.842561+0.020914
## [27] train-auc:0.904406+0.004190 test-auc:0.842735+0.020948
## [28] train-auc:0.905227+0.004289 test-auc:0.842752+0.021067
## [29] train-auc:0.906209+0.004341 test-auc:0.842606+0.021287
## [30] train-auc:0.907177+0.004456 test-auc:0.842585+0.021178
## [31] train-auc:0.908163+0.004328 test-auc:0.842415+0.021261
## [32] train-auc:0.909006+0.004255 test-auc:0.842405+0.021257
## [33] train-auc:0.909909+0.004513 test-auc:0.842559+0.021393
## [34] train-auc:0.910788+0.004548 test-auc:0.842498+0.021585
## [35] train-auc:0.911612+0.004438 test-auc:0.842211+0.021635
## [36] train-auc:0.912225+0.004474 test-auc:0.842090+0.021642
## [37] train-auc:0.912929+0.004676 test-auc:0.841941+0.021602
## [38] train-auc:0.913426+0.004645 test-auc:0.841943+0.021822
## Stopping. Best iteration:
## [28] train-auc:0.905227+0.004289 test-auc:0.842752+0.021067
print(cv_result)
## ##### xgb.cv 5-folds
## iter train_auc_mean train_auc_std test_auc_mean test_auc_std
## <num> <num> <num> <num> <num>
## 1 0.8607430 0.004023058 0.8284462 0.02303896
## 2 0.8656585 0.003396626 0.8320313 0.02182972
## 3 0.8698862 0.004020688 0.8351896 0.02115889
## 4 0.8718658 0.003650155 0.8359339 0.02083240
## 5 0.8746666 0.004006366 0.8373814 0.02044100
## 6 0.8767569 0.004223795 0.8377409 0.02101495
## 7 0.8789868 0.004663579 0.8383465 0.02082135
## 8 0.8803395 0.004763656 0.8385266 0.02079024
## 9 0.8819908 0.004252530 0.8385476 0.02047856
## 10 0.8835484 0.004299779 0.8392954 0.02037471
## 11 0.8849516 0.003763471 0.8396021 0.02047257
## 12 0.8863507 0.003854375 0.8404646 0.02054875
## 13 0.8879733 0.003748580 0.8408727 0.02079586
## 14 0.8894951 0.003587989 0.8412867 0.02120681
## 15 0.8910499 0.004001213 0.8417206 0.02106110
## 16 0.8923321 0.004076423 0.8420971 0.02157389
## 17 0.8936628 0.004112324 0.8422242 0.02141595
## 18 0.8951718 0.003787858 0.8419558 0.02122072
## 19 0.8963529 0.004053059 0.8421179 0.02152256
## 20 0.8975850 0.003919699 0.8422561 0.02119816
## 21 0.8988031 0.003980489 0.8424331 0.02123267
## 22 0.8998149 0.004086370 0.8426167 0.02106615
## 23 0.9007594 0.004255407 0.8425907 0.02102223
## 24 0.9018729 0.004339329 0.8421536 0.02075988
## 25 0.9027289 0.004201750 0.8422820 0.02090800
## 26 0.9036141 0.004247534 0.8425611 0.02091397
## 27 0.9044057 0.004190053 0.8427351 0.02094780
## 28 0.9052268 0.004289175 0.8427520 0.02106684
## 29 0.9062089 0.004341347 0.8426062 0.02128704
## 30 0.9071769 0.004455786 0.8425846 0.02117822
## 31 0.9081635 0.004327935 0.8424150 0.02126060
## 32 0.9090064 0.004254807 0.8424053 0.02125747
## 33 0.9099088 0.004512705 0.8425587 0.02139308
## 34 0.9107879 0.004547732 0.8424975 0.02158477
## 35 0.9116117 0.004438449 0.8422114 0.02163533
## 36 0.9122248 0.004473660 0.8420899 0.02164200
## 37 0.9129294 0.004675734 0.8419408 0.02160185
## 38 0.9134262 0.004644753 0.8419434 0.02182182
## iter train_auc_mean train_auc_std test_auc_mean test_auc_std
## Best iteration:
## iter train_auc_mean train_auc_std test_auc_mean test_auc_std
## <num> <num> <num> <num> <num>
## 28 0.9052268 0.004289175 0.842752 0.02106684
# Create DMatrix object
data_matrix <- xgb.DMatrix(data = as.matrix(data[-c(1, which(names(data) == "Churn"))]), label = data$Churn)
# get the best iteration
best_nrounds <- cv_result$best_iteration
model <- xgb.train(
params = params,
data = data_matrix,
nrounds = best_nrounds,
verbose = TRUE
)
train_pred <- predict(model, train_matrix)
train_predicted_label <- ifelse(train_pred > 0.5, 1, 0)
test_pred <- predict(model, test_matrix)
test_predicted_label <- ifelse(test_pred > 0.5, 1, 0)
roc_train <- roc(train_data$Churn, train_pred)
roc_test <- roc(test_data$Churn, test_pred)
auc_train <- auc(roc_train)
auc_test <- auc(roc_test)
print(paste("AUC of training data: ", auc_train))
## [1] "AUC of training data: 0.890639729276803"
print(paste("AUC of testing data: ", auc_test))
## [1] "AUC of testing data: 0.888676918185291"
# Null model prediction: using the mean of Churn in the training set as the probability
mean_churn <- mean(train_data$Churn)
null_train_pred <- rep(mean_churn, nrow(train_data))
null_test_pred <- rep(mean_churn, nrow(test_data))
# Compute the ROC of null model
roc_null_train <- roc(train_data$Churn, null_train_pred)
roc_null_test <- roc(test_data$Churn, null_test_pred)
auc_null_train <- auc(roc_null_train)
auc_null_test <- auc(roc_null_test)
Print AUC values for the null model
# Print AUC values for the null model
print(paste("AUC for null model on training data: ", auc_null_train))
## [1] "AUC for null model on training data: 0.5"
print(paste("AUC for null model on testing data: ", auc_null_test))
## [1] "AUC for null model on testing data: 0.5"
# 使用 DeLong 檢驗比較兩個 AUC
roc_test_xgb <- roc(test_data$Churn, test_pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_test_null <- roc(test_data$Churn, null_test_pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# DeLong 檢驗
delong_test <- roc.test(roc_test_xgb, roc_test_null, method="delong")
# 輸出檢驗結果
print(delong_test)
##
## DeLong's test for two correlated ROC curves
##
## data: roc_test_xgb and roc_test_null
## Z = 43.622, p-value < 2.2e-16
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
## 0.3712132 0.4061406
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.8886769 0.5000000
importance_matrix <- xgb.importance(model = model, feature_names = colnames(train_data[-c(1, which(names(train_data) == "Churn"))]))
xgb.plot.importance(importance_matrix)
train_results <- data.frame(customerID = train_data$customerID, label = train_predicted_label, probability = train_pred, groundtruth = train_data$Churn)
test_results <- data.frame(customerID = test_data$customerID, label = test_predicted_label, probability = test_pred, groundtruth = test_data$Churn)
write_csv(train_results, "train_predictions.csv")
write_csv(test_results, "test_predictions.csv")
head(test_results, 5)
## customerID label probability groundtruth
## 1 0530-IJVDB 0 0.10891096 0
## 2 3208-YPIOE 0 0.42432478 1
## 3 3237-AJGEH 1 0.58590657 1
## 4 3211-ILJTT 0 0.35297447 1
## 5 7337-CINUD 0 0.04226052 0
write_csv(importance_matrix, "feature_importance.csv")
head(importance_matrix, 5)
## Feature Gain Cover Frequency Importance
## <char> <num> <num> <num> <num>
## 1: ContractMonth.to.month 0.43888966 0.16064444 0.01791868 0.43888966
## 2: tenure 0.18648048 0.20786294 0.23569952 0.18648048
## 3: InternetService 0.11661886 0.09241762 0.02618884 0.11661886
## 4: MonthlyCharges 0.10358910 0.18545820 0.30048243 0.10358910
## 5: PaymentMethodElectronic.check 0.02191083 0.06834488 0.05375603 0.02191083
xgb.save(model, "../model/churn_prediction_model.xgb")
## [1] TRUE
# Assuming roc_train_plot and roc_test_plot are ggplot objects for ROC curves
roc_train_plot <- ggroc(roc_train) + ggtitle("ROC Curve - Training Data")
roc_test_plot <- ggroc(roc_test) + ggtitle("ROC Curve - Testing Data")
# Save ROC curves
ggsave("roc_train.png", plot = roc_train_plot)
## Saving 7 x 5 in image
ggsave("roc_test.png", plot = roc_test_plot)
## Saving 7 x 5 in image
print(roc_train_plot)
print(roc_test_plot)
Lift analysis 是一種在資料科學和機器學習中常用的評估技術,尤其在行銷和推薦系統中十分常見。主要目的是衡量一個策略、活動或模型相對於隨機選擇的效果提升。以下是 Lift analysis 的一些關鍵點:
\(Lift\) 是一個比率,表示目標行為在有策略介入時的發生率,與無策略介入時的發生率之比。
公式為 \[ Lift = \frac {P(B|A)}{P(B)}\]
其中 \(P(B∣A)\) 是在條件 \(A\) 下發生 \(B\) 的概率,而 \(P(B)\) 是無條件下發生 \(B\) 的概率。
\(Lift\) 分析幫助確定某個特定行動或模型是否對結果有正向影響,以及這個影響是否顯著超過隨機事件。
(Credit by ChatGPT)
# sorting by probability
test_results <- test_results[order(-test_results$probability),]
head(test_results, 10)
## customerID label probability groundtruth
## 90 7665-TOALD 1 0.8760448 1
## 95 1447-GIQMR 1 0.8760448 1
## 160 1820-TQVEV 1 0.8760448 1
## 377 5134-IKDAY 1 0.8742824 1
## 92 5186-SAMNZ 1 0.8740061 1
## 35 6894-LFHLY 1 0.8727815 1
## 301 2868-MZAGQ 1 0.8720479 1
## 1216 0495-RVCBF 1 0.8720479 1
## 41 9300-AGZNL 1 0.8701169 1
## 413 7274-RTAPZ 1 0.8647000 1
# Segmenat the customer
test_results$decile <- cut(test_results$probability, breaks=quantile(test_results$probability, probs=seq(0, 1, by = 0.1)), include.lowest=TRUE, labels=FALSE)
# Reverse the decilne numbering
test_results$decile <- 11 - test_results$decile
head(test_results, 5)
## customerID label probability groundtruth decile
## 90 7665-TOALD 1 0.8760448 1 1
## 95 1447-GIQMR 1 0.8760448 1 1
## 160 1820-TQVEV 1 0.8760448 1 1
## 377 5134-IKDAY 1 0.8742824 1 1
## 92 5186-SAMNZ 1 0.8740061 1 1
# 計算每個分組的實際響應率和 Lift
test_lift_df <- test_results %>%
group_by(decile) %>%
summarise(
count = n(),
num_responses = sum(label),
response_rate = mean(label),
lift = response_rate / mean(data$Churn)
)
plot <- ggplot(test_lift_df, aes(x = as.factor(decile), y = lift)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Lift Chart", x = "Decile", y = "Lift") +
theme_minimal()
# 使用 ggsave 保存圖形
ggsave("lift_chart.png", plot, width = 10, height = 6, dpi = 300)
img <- readPNG("lift_chart.png")
print(plot)
# Save the lift data to CSV
write.csv(test_lift_df, "lift_data.csv", row.names = FALSE)
head(test_lift_df, 10)
## # A tibble: 10 × 5
## decile count num_responses response_rate lift
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 141 141 1 3.77
## 2 2 141 141 1 3.77
## 3 3 141 13 0.0922 0.347
## 4 4 140 0 0 0
## 5 5 141 0 0 0
## 6 6 141 0 0 0
## 7 7 140 0 0 0
## 8 8 141 0 0 0
## 9 9 125 0 0 0
## 10 10 157 0 0 0